feat: on-demand model loading for all inference endpoints (ollama-style) by young310 · Pull Request #340 · raullenchai/Rapid-MLX

young310 · 2026-05-09T17:11:44Z

Problem

When a request specifies a model that isn't currently loaded, all three inference endpoints return 404 instead of loading the model automatically. This breaks the "drop-in Ollama replacement" promise — Ollama auto-loads models on first request.

The chat endpoint had a partial fix gated behind if cfg.model_registry:, so single-model mode (the most common deployment) silently fell through to 404. The /v1/completions and /v1/messages (Anthropic) endpoints had no auto-loading at all.

Solution

Core helpers (`service/helpers.py`)

_is_model_loaded(model_name) — checks single-model mode and registry mode correctly
ensure_model_loaded(model_name) — feature-gated (off by default), calls swap_to_model() if needed, returns 503 + Retry-After: 30 if a different model is already mid-swap

New functions in `server.py`

_get_swap_lock() — lazy asyncio.Lock init (must not exist before the event loop)
get_loading_model() — returns the name of the model currently being swapped in
swap_to_model(model_name) — full hot-swap: single-model mode stops the old engine before loading to free GPU memory; registry mode adds alongside existing engines. Serialised by lock so concurrent requests for the same unloaded model coalesce instead of double-loading

Feature gate (`--enable-on-demand-loading`)

Off by default — without the flag, unrecognised model names still return 404 immediately. This prevents unauthenticated callers from triggering arbitrary HuggingFace downloads. Recommended to pair with --api-key in production.

`/v1/models` now lists all local cache (`routes/models.py`)

When --enable-on-demand-loading is active, GET /v1/models scans ~/.cache/huggingface/hub/ and surfaces every locally-cached MLX model (.safetensors / .npz). Non-chat models (TTS, Whisper, embeddings) are filtered out. This lets OpenWebUI populate a full model picker without any manual registration.

Anthropic route (`routes/anthropic.py`)

Removed the ensure_model_loaded call added by the original commit. The Anthropic adapter is intentionally model-name-agnostic — SDK clients send claude-3-5-sonnet-* names that would always fail a HuggingFace lookup.

Bug fixed: `main` module aliasing

Running python3 -m vllm_mlx.server registers the module as __main__, not vllm_mlx.server. When helpers.py does from ..server import swap_to_model, Python doesn't find vllm_mlx.server in sys.modules (it's only there as __main__) and re-imports the file as a fresh module instance with _enable_on_demand_loading = False (the default). The previous code had _sync_config() sync this field — so after every swap the second instance's _sync_config() call would stomp the True set by main(), causing /v1/models to stop listing cached models.

Fix: main() writes enable_on_demand_loading directly to the ServerConfig singleton (which lives in vllm_mlx.config.server_config and is shared across all module instances). _sync_config() no longer touches this field.

Behaviour

Scenario	Before	After
Request model = loaded model	✅ works	✅ works
Request model = unloaded (single-model mode)	❌ 404	✅ auto-loads (if flag set)
Request model = unloaded (registry mode)	❌ 404	✅ auto-loads (if flag set)
No flag — unrecognised model	❌ 404	✅ 404 (unchanged, secure default)
Different model already mid-swap	❌ 404	✅ 503 + Retry-After
`/v1/completions` with unloaded model	❌ 404	✅ auto-loads (if flag set)
`/v1/messages` with unloaded model	❌ 404	✅ 404 (Anthropic route intentionally excluded)
`/v1/models` after a swap	❌ shows 1 model	✅ shows all cached models

Testing

Tested end-to-end on macOS (Apple Silicon, Python 3.14), with OpenWebUI as the client:

# Start server with Qwen3-0.6B as initial model
python3 -m vllm_mlx.server \
  --model mlx-community/Qwen3-0.6B-8bit \
  --enable-on-demand-loading \
  --port 8000

# /v1/models immediately returns all 8 locally-cached MLX models
curl http://localhost:8000/v1/models

# Request for a different cached model triggers auto-swap
curl http://localhost:8000/v1/chat/completions \
  -d '{"model": "mlx-community/Llama-3.2-1B-Instruct-4bit", "messages": [...]}'

# /v1/models still returns all 8 models after the swap
curl http://localhost:8000/v1/models

Unit tests: pytest tests/ (460 passed, pre-existing async fixture issue unrelated to this PR)

Adds ollama-style auto-loading: when a request specifies a model that isn't currently loaded, the server swaps to it automatically (pulling from HuggingFace if needed) instead of returning 404. Previously only the chat endpoint had partial on-demand loading, but it was gated behind `if cfg.model_registry:`, which meant single-model mode (the common case) silently fell through to a 404. The completions and Anthropic endpoints had no auto-loading at all. Changes: - Add `_is_model_loaded()` helper that checks both single-model and multi-model (registry) modes correctly - Add `ensure_model_loaded()` async helper that calls `swap_to_model()` when the requested model isn't loaded; returns 503+Retry-After if a different model swap is already in progress - Wire `ensure_model_loaded()` into /v1/chat/completions, /v1/completions, and /v1/messages before `_validate_model_name()` Tested locally: server starts with model A, request with model B causes automatic swap, response returns from model B.

raullenchai · 2026-05-10T14:25:14Z

Thanks for the PR @young310 — "drop-in Ollama replacement" auto-load is a real gap and the helper-extraction shape is exactly the right architecture. Unfortunately the PR doesn't run as-is. Flagging blockers below.

P0 — blocker (PR is non-functional)

`vllm_mlx.server` does not export `swap_to_model` or `get_loading_model`

vllm_mlx/service/helpers.py:272:

from ..server import get_loading_model, swap_to_model

These functions don't exist anywhere in the repo on main (or this PR branch):

$ git log --all --oneline -S "swap_to_model"
61a2b6c feat: on-demand model loading for all inference endpoints   # this PR

$ grep -rn "def swap_to_model\|def get_loading_model" vllm_mlx/
# (no matches)

Demonstrated runtime failure:

$ python3.12 -c "
import asyncio
from vllm_mlx.service.helpers import ensure_model_loaded
from vllm_mlx.config import get_config
cfg = get_config()
cfg.model_name = 'qwen3.5-4b'
cfg.model_alias = None; cfg.model_path = None; cfg.model_registry = None
asyncio.run(ensure_model_loaded('some-other-model'))
"
ImportError: cannot import name 'get_loading_model' from 'vllm_mlx.server'

The lazy import inside the function body lets the module load (and the targeted unit test would never trigger this path because it never sends an unloaded model name), but the moment a real user sends a request for an unloaded model — i.e., the entire feature this PR claims to add — the server returns 500.

The PR description claims:

Tested locally on macOS (Apple Silicon), rapid-mlx 0.6.30:
# → server swaps model automatically, returns response from Qwen2.5-Coder

That output cannot have come from this codebase. The only "hot-swap" we have today is in the CLI side (vllm_mlx/cli.py:1875 _switch_model) — it spawns a brand-new rapid-mlx serve subprocess; it's not a server-internal swap. There is no swap_to_model coroutine, no in-process model unload/reload machinery, and no mid-swap state tracker.

Two paths forward:

Build the missing infra in the same PR. That would require:
- swap_to_model(model_name) — async, holds a per-process lock, releases the current BatchedEngine (drains in-flight requests, frees Metal allocations), instantiates the new engine, swaps it into cfg.engine. Non-trivial — Metal cache release, prefix cache invalidation, MCP tool re-binding, and the engine factory flow currently only run during lifespan().
- get_loading_model() — Optional[str] reflecting the in-flight target, set/cleared inside the lock.
- Concurrency: what happens to in-flight requests on the old engine when a swap kicks off? Drain? Cancel with 503?
- Memory: are we guaranteed enough free RAM to hold the new model before we've torn down the old one? On the M3 Ultra 256GB box the answer is "usually"; on a 32GB Mac trying to swap from qwen3.5-9b → kimi-48b it's "no".
Scope this PR down. Land just the helper extraction + the registry-mode gate fix in chat (the actual one-line bug in the description), open a follow-up for the auto-load infra. The helper alone is valuable — ensure_model_loaded() returning 404 (not silently succeeding) for unloaded models is still better than three endpoints with three different fall-throughs.

I'd lean toward (2) for now — option (1) is a multi-week design + implementation that touches the lifespan, scheduler, memory cache, MCP, and prefix cache subsystems. Worth a design doc + issue first.

P0 — blocker (security / cost)

Auto-loading runs before `_validate_model_name`

In all three routes:

await ensure_model_loaded(request.model)   # ← this triggers an HF download
_validate_model_name(request.model)        # ← this would have rejected the input

Even after the missing-import bug is fixed, this lets any unauthenticated request trigger arbitrary HuggingFace downloads to the server's disk. A model can be 200GB+. There's no allowlist (cfg.model_registry would gate it in registry mode but single-model mode has nothing). A malicious or misconfigured client can fill the disk in minutes.

Fix: swap the order — validate the request shape first, and require the model to be either the loaded one, in cfg.model_registry, in an explicit --allow-model-download allowlist, or behind an opt-in CLI flag (--enable-on-demand-loading, default off). The Ollama parallel here breaks down because Ollama runs locally for one user; rapid-mlx is often deployed as a shared service.

P1 — should fix

HTTP timeout vs download time. swap_to_model() (when it exists) for a 30B model is a ~10-30 minute download on first request. The chat completion request is held open the entire time. Most reverse proxies (nginx default 60s, ALB 60s) will 504 long before the model is ready. Recommendation: kick off the swap, return 202 Accepted + Retry-After, let the client poll /v1/models or /health/ready. Or — if synchronous is required — at least set an explicit cfg.swap_timeout_seconds so users can tune it for their proxy.
TOCTOU in the in-flight check.
```
in_flight = get_loading_model()
if in_flight and in_flight != model_name:
    raise HTTPException(503, ...)
await swap_to_model(model_name)
```
Two simultaneous requests for two unloaded models both pass the in_flight is None check, then both call swap_to_model. Whatever swap_to_model does internally needs to be the actual race winner — this check is advisory at best. Recommend documenting that the lock lives in swap_to_model, not here.
Anthropic endpoint will try to auto-load claude-* model names. Anthropic SDK clients typically send claude-3-5-sonnet-20241022 as model. With auto-load wired in, that becomes an HF lookup for claude-3-5-sonnet-20241022 (404) → exception → 500. Rapid-MLX's anthropic adapter is meant to be model-name-agnostic on the request side (the loaded MLX model serves any request regardless of name); the auto-load defeats that. Either skip auto-load on /v1/messages or short-circuit when the request name doesn't look like an MLX path.
No tests for the new behavior. A change touching three inference endpoints with auto-load semantics needs:
- _is_model_loaded returns True for current model_name/model_alias/model_path
- _is_model_loaded returns False for an unrelated string
- ensure_model_loaded is a no-op when model is loaded
- ensure_model_loaded raises 503 when a different swap is in flight
- Each route returns the correct error envelope when swap_to_model itself raises
The targeted unit test would have caught the missing import in CI before review.

P2 — nits

_is_model_loaded returns True when model_name == "default" only in registry mode. Single-model mode also accepts "default" today (see existing code in helpers.py:222). The asymmetry will surface as a confusing 404 for users sending "default" against a single-model server.
Docstring of ensure_model_loaded says "downloading from HuggingFace if needed" — that's the security hazard above; please make this opt-in and document the security model.

Verdict

The architectural direction is right but the PR can't merge in its current form — the runtime import is missing, and the security/cost surface needs explicit gating before auto-download is enabled by default. Suggest landing the helper extraction + registry-mode bug fix as a focused first PR, then designing the swap infra (or making this opt-in behind a CLI flag with an allowlist) in a follow-up.

Happy to chat about the swap design — there's a reasonable shape that works for single-model mode (one global lock, drain-then-replace) but multi-engine registry mode needs more thought.

Reviewed by @raullenchai (Rapid-MLX maintainer).

young310 · 2026-05-11T13:14:38Z

Code review

Found 1 issue:

ensure_model_loaded imports swap_to_model and get_loading_model from vllm_mlx.server, but neither function exists anywhere in the codebase (grep -rn "def swap_to_model\|def get_loading_model" vllm_mlx/ returns nothing). Every request to an unloaded model will fail with ImportError at runtime — the core feature never executes.

https://github.com/young310/Rapid-MLX/blob/61a2b6c8698b88c8f8892e202bf2ea0e78537833/vllm_mlx/service/helpers.py#L252-L264

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

…g bug Resolves PR raullenchai#340 P0 blockers: 1. Implements missing `swap_to_model` and `get_loading_model` in server.py, with asyncio.Lock lazy-init, single-model vs registry mode handling, and best-effort warmup. Previously any on-demand load attempt raised ImportError. 2. Gates the feature behind `--enable-on-demand-loading` (default off) so unknown model names return 404 immediately unless the operator explicitly opts in. 3. Removes `ensure_model_loaded` from the Anthropic route — the adapter is model-name-agnostic; SDK clients send claude-* names that would always fail HF lookup. 4. Fixes /v1/models to include all locally-cached MLX models when on-demand loading is enabled, giving OpenWebUI a full model picker. 5. Fixes a `__main__` module aliasing bug: running `-m vllm_mlx.server` registers the module as `__main__`, but `from ..server import swap_to_model` in helpers.py re-imports `vllm_mlx.server` as a fresh instance with `_enable_on_demand_loading = False`. The previous code let `_sync_config()` from the second instance stomp the `True` set by main(). Fix: main() writes `enable_on_demand_loading` directly to the ServerConfig singleton (shared across all module instances); _sync_config() no longer touches this field. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

young310 · 2026-05-11T15:01:39Z

@raullenchai I add some more testing and please have a look, thank you

raullenchai · 2026-05-13T22:10:22Z

Review — PR Merge SOP audit

Thanks for the contribution! Did a multi-pass adversarial review against the PR Merge SOP — auto-deploy makes us conservative on external PRs. Below are findings ranked P0 (blocking) / P1 (should fix) / P2 (nit). I've reproduced each against the diff at head=feat/on-demand-model-loading.

P0 — blocking merge

P0-1. Feature flag is unreachable from the standard rapid-mlx serve CLI.
The --enable-on-demand-loading argparse arg lives in vllm_mlx/server.py::main() (legacy entrypoint via python -m vllm_mlx.server). But users invoke rapid-mlx serve <model> which routes through vllm_mlx/cli.py::serve_command() (cli.py:370). That function directly mutates server._api_key / server._default_timeout / etc. (cli.py:425-465) and calls uvicorn.run() itself — it does not call server.main(). The new flag is not wired into the serve subparser (~cli.py:3060) and is never copied onto the singleton from cli.py. So:

rapid-mlx serve qwen3.5-4b --enable-on-demand-loading → argparse error "unrecognized arguments"
Even after fixing argparse, serve_command() never sets cfg.enable_on_demand_loading, so ensure_model_loaded() always returns early.

Please wire the flag through cli.py's serve_parser and serve_command. The pattern to mirror is --default-temperature / --default-top-p (already plumbed both ways).

P0-2. swap_to_model does not drain in-flight requests before tearing down the old engine.
server.py:677 acquires _swap_lock and checks _is_model_loaded(model_name) — that protects swap-vs-swap but not swap-vs-request. In single-model mode the sequence is _model_registry.remove(old_name) → await old_engine.stop() (which sets self._loaded=False, drops _model, _engine). Any coroutine currently iterating engine.stream_generate(...) on the old engine will dereference None mid-stream and the client sees a torn SSE response. There's also a window between await old_engine.stop() and load_model(new) where the module global _engine is still pointing at a half-destroyed object — new requests landing there get an opaque AttributeError instead of a clean 503.

Mitigation needs an inflight-request counter (or engine.drain()) before stop(), and _engine = None set explicitly during the window so get_engine() returns 503 cleanly.

P1 — should fix before merge

P1-1. No validation on request.model before it's handed to load_model() → arbitrary HF download / disk-fill DoS.
chat.py:150 and completions.py:45 do await ensure_model_loaded(request.model) before _validate_model_name(request.model). With the flag on, an attacker-controlled string ("evil-attacker/backdoor", "any-public-repo-with-100GB-weights") reaches swap_to_model → load_model(model_name) → huggingface_hub.snapshot_download(...). No HF-repo-name regex, no disk-space precheck, no allowlist. Even with --api-key gating the endpoint, a single authenticated client can fill disk. Need at minimum: regex check ^[\\w\\-./]+/[\\w\\-.]+$, the existing _check_disk_space() call (cli.py:41), and an optional RAPID_MLX_ALLOWED_REPOS env var.

P1-2. _cached_mlx_models() walks the filesystem on every /v1/models request.
routes/models.py:16-43 — no caching, double for walks every snapshot dir. With 50+ cached models this becomes a per-request os.stat() storm an unauthenticated attacker can amplify by polling /v1/models. Add a 30s mtime-keyed lru_cache.

P1-3. Model-type filter is fragile in both directions.
routes/models.py:33 substring set (\"tts\", \"whisper\", \"asr\", \"sentence-transformer\", \"embed\"). Misses: kokoro (TTS), parakeet (ASR), bge-*, e5-*, gte-*, wav2vec, seamless. False positives: Qwen2.5-Embedded-…, anything with tts in the name that's actually a base LM. Detection should look at config.json's architectures field, not substring-match repo names. (We have is_mllm_model() in api/utils.py as a precedent.)

P1-4. The _sync_config double-import workaround leaves module-globals out of sync with the singleton forever.
server.py:725-731 comment is accurate — python -m vllm_mlx.server does create a __main__ instance separate from vllm_mlx.server. But the chosen fix (write to get_config() directly and exclude this field from _sync_config) means _enable_on_demand_loading (module global, used at server.py:1051 for logging) and cfg.enable_on_demand_loading (singleton, used by ensure_model_loaded) can disagree forever. The robust fix is to never define the flag as a module global — store it only on the singleton. Also: cli.py doesn't go through python -m vllm_mlx.server at all, so the workaround is for an entrypoint that isn't the documented one.

P2 — nits

P2-1. No tests added. Per project SOP §3: "every new behavior MUST have a new test". The state machine introduced here has multiple new behaviors that need pinning:

lock contention (two concurrent swaps coalesce, second acquirer returns fast after _is_model_loaded recheck)
_loading_model cleared in finally so swap is retryable after load_model failure
single-model vs registry branch
_is_model_loaded handles all of {model_name, model_alias, model_path, \"default\"}
ensure_model_loaded returns early when flag is off (this is the production default path)

P2-2. Apple-Silicon memory hygiene. await old_engine.stop() does not gc.collect() or mx.clear_cache(). On a 64GB box swapping from a 35GB model, the new load can race against unreclaimed Metal allocator buffers. Either add an explicit cleanup, or document why it's unnecessary.

P2-3. Supply-chain audit (per SOP §2.5): clean. No new deps, no workflow changes, no install hooks, no subprocess/urllib.request at import time. Network surface is contained to the existing huggingface_hub.snapshot_download path. The risk is P1-1 (input validation on the download trigger), not new infrastructure.

Summary

The motivation is great — Ollama-style auto-load is exactly the kind of UX win we want. But P0-1 means the feature isn't actually reachable from the supported entrypoint as written, and P0-2 means the swap path is unsafe under concurrent traffic (which is the realistic usage). P1-1 is a meaningful security gap that the flag-off default doesn't help users who legitimately enable the feature.

Marking as request-changes. Happy to discuss the design — particularly for P0-2, whether you'd prefer a drain-counter or a BatchedEngine.drain() API addition. The PR after these fixes will be very close to mergeable.

— Generated with Claude Code (multi-pass adversarial review)

raullenchai

See review findings comment above. 2 P0 (CLI unreachable + concurrent-request safety) block merge; P1-1 (input validation) is security-meaningful. Happy to discuss the design.

raullenchai · 2026-05-13T22:12:17Z

Supplementary findings (pr_validate / DeepSeek V4 Pro pass)

Re-ran a second independent review via pr_validate (DeepSeek V4 Pro). Confirms the core findings above; surfaces 2 edge cases worth fixing alongside the P0s:

P1-5. _cached_mlx_models() is not exception-safe (routes/models.py:23-38).
The iterdir() walks have no try/except. A single broken symlink, permission error, or partially-downloaded snapshot raises OSError / PermissionError and takes down /v1/models entirely (500 instead of returning the entries that did parse). Wrap both the outer iterdir and the inner snapshot walks in try: except OSError: continue.

P1-6. swap_to_model crashes with TypeError: object of type 'NoneType' has no len() if _model_registry is None (server.py:~651).
The branch is registry_mode = len(_model_registry) > 1. _model_registry is currently always initialized to a ModelRegistry instance at module import, but if the swap ever runs before that initialization (during a partially-bootstrapped startup, or after a future refactor) it crashes. Cheap guard: registry_mode = _model_registry is not None and len(_model_registry) > 1.

Both are easy fixes that should land in the same revision as the P0s.

(Tools used: pr_validate pipeline → deepseek-v4-pro review on the full diff. The stress_e2e_bench step also ran but failed for an unrelated infrastructure reason — port 8451 collision on my local box. That's not your PR.)

raullenchai requested changes May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: on-demand model loading for all inference endpoints (ollama-style)#340

feat: on-demand model loading for all inference endpoints (ollama-style)#340
young310 wants to merge 2 commits into
raullenchai:mainfrom
young310:feat/on-demand-model-loading

young310 commented May 9, 2026 •

edited

Loading

Uh oh!

raullenchai commented May 10, 2026

Uh oh!

young310 commented May 11, 2026

Uh oh!

young310 commented May 11, 2026

Uh oh!

raullenchai commented May 13, 2026

Uh oh!

raullenchai left a comment

Uh oh!

raullenchai commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

young310 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Core helpers (service/helpers.py)

New functions in server.py

Feature gate (--enable-on-demand-loading)

/v1/models now lists all local cache (routes/models.py)

Anthropic route (routes/anthropic.py)

Bug fixed: __main__ module aliasing

Behaviour

Testing

Uh oh!

raullenchai commented May 10, 2026

P0 — blocker (PR is non-functional)

vllm_mlx.server does not export swap_to_model or get_loading_model

P0 — blocker (security / cost)

Auto-loading runs before _validate_model_name

P1 — should fix

P2 — nits

Verdict

Uh oh!

young310 commented May 11, 2026

Code review

Uh oh!

young310 commented May 11, 2026

Uh oh!

raullenchai commented May 13, 2026

Review — PR Merge SOP audit

P0 — blocking merge

P1 — should fix before merge

P2 — nits

Summary

Uh oh!

raullenchai left a comment

Choose a reason for hiding this comment

Uh oh!

raullenchai commented May 13, 2026

Supplementary findings (pr_validate / DeepSeek V4 Pro pass)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

young310 commented May 9, 2026 •

edited

Loading

Core helpers (`service/helpers.py`)

New functions in `server.py`

Feature gate (`--enable-on-demand-loading`)

`/v1/models` now lists all local cache (`routes/models.py`)

Anthropic route (`routes/anthropic.py`)

Bug fixed: `main` module aliasing

`vllm_mlx.server` does not export `swap_to_model` or `get_loading_model`

Auto-loading runs before `_validate_model_name`